Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IQSS/8286 enable i18n search of cvv fields #8435

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Feb 17, 2022

What this PR does / why we need it: Before this, controlled vocabulary values were only being indexed by the cvv names (i.e. as in the tsv) and not by the language-specific translations. This creates #8286 in which users (in the UI or via API) don't have a good way to search for these values (UI users can use the facet, but putting the language-specific term in the facet in the search box would fail to find the same datasets.) As an incremental fix, this PR leaves the facet field (e.g. subect_ss) as is but indexes all of the configured language variants in the main field (e.g. subject). This means that searches against the field will work (e.g. subject:Chimie - French for Chemistry), and the same search via API will also work. Configured languages are those configured for either the display languages (which ones the overall UI can be shown in ) and those allowed for Dataset metadata (e.g. controlled by the metadata languages setting).

Which issue(s) this PR closes:

Closes #8286

Special notes for your reviewer:
As noted, this is incremental. The limitations are:

  • because the facet search is still against the facet field (the _s or _ss postfix), the URL and the facet tag showing which facets are used in the current search still show the untranslated value. It makes sense to keep a facet field with untranslated values (adding translations there would add facets), but because the tags showing which facets are used come directly from the solr fields involved, it is complex to backtrack to the underlying field to then find which translated values exist. (For example, a human knows subject_ss probably relates to the subject field (actually the dvSubject dataverse-level values go in here too) and may then know that this field is in the 'citation' block, but discovering that with code probably means scanning all blocks or keeping a map around,etc.
  • The _s or _ss fields are 'string' or 'strings' for solr whereas the main fields are type text_en. The effect of this is that searches against the main field (e.g. subject) can return partial/close hits whereas searches against the facet field (subject_ss) are full hits or miss. It may make sense to actually set the main field for CVV fields to type string(s) anyway since one presumably wants exact hits, but I decided against doing that in this PR a) to limit scope and b) hoping that the dynamic schema.xml generation code that isn't yet merged would simplify making such a change (and handling new CVV fields in custom blocks, etc.) In the meantime, since most facet CVV values are unlikely to overlap much, the practical effect is probably small. (I suppose search for 'and' could pull up several two part facets, etc.).

Suggestions on how to test this: Configure with multiple display and/or metadata languages and verify that searches using the basic search box work with the translated values. And that facet search also works as before. One could also test the API call with the same queries. Sciences PO might be able to assist in setup and/or testing.

Does this PR introduce a user interface change? If mockups are available, please link/include them here: only in terms of allowing user entered translated terms to produce search results.

Is there a release notes update needed for this change?:

Additional documentation:

Conflicts:
	src/main/java/edu/harvard/iq/dataverse/search/IndexServiceBean.java
@landreev landreev self-assigned this Mar 14, 2022
@landreev
Copy link
Contributor

Since the PR appears to have some overlapping parts with #8437, is there anything special about how the 2 need to be handled? It looks like it should be safe to merge them in any order... but is it?

Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to re-read the part about "strings" (_s and _ss) vs. text_en a couple of times... But I think it all makes sense.

@qqmyers
Copy link
Member Author

qqmyers commented Mar 15, 2022

I think all the PRs are independent w.r.t. merging. It has certainly been convenient to test them all together, but there shouldn't be any code dependencies.

@landreev landreev removed their assignment Mar 15, 2022
@kcondon kcondon self-assigned this Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request/Idea: Change the language of facets in the search API
4 participants